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ABSTRACT : 

A verification system to determine unknown input speech contains a 
recognized keyword or consists of speech or other sounds that do not contain 
any of the keywords. The verification system is designed to operate on the 
subword level, so that the verification process is advantageously vocabulary 
independent. Such a vocabulary- independent verifier is achieved by a two-stage 
verification process comprising subword level verification followed by string 
level verification. The subword level verification stage verifies each subword 
segment in the input speech as determined by an Hidden Markov Model recognizer 
to determine if that segment consists of the sound corresponding to the subword 
that the HMM recognizer assigned to that segment. The string level 
verification stage combines the results of the subword level verification to 
make the rejection decision for the whole keyword. Advantageously, the 
training of this two-stage verifier is independent of the specific vocabulary 
set implying that when the vocabulary set is update or changed the verifier 
need not be retrained and can still be reliably verifying the new set of 
keywords . 

14 Claims, 10 Drawing figures 
Exemplary Claim Number: 1 
Number of Drawing Sheets: 5 



Detailed Description Text - DETX (32) : 

According to this invention the concept of " ant i- subword class " HMM models 
is used. The ant i- subword class models are constructed by, first, clustering 
the subword units into J classes/ where J<&lt/K. A hierarchical clustering 
algorithm is used to cluster subwords based on minimizing the overall 
inter-subword cluster confusion rate. Subwords that are highly confusable with 
other subwords are likely to be clustered together. Given the subword model 
set, [s.sub.i 3, the subword confusion information is obtained from the subword 
confusion matrix of the training set of sample sentences. Setting J=6, the 
constituency of each class is given in Table 5, which shows, for example, that 
certain vowel sounds are clustered into one class (Class B) while nasals are 
included under Class E. For each subword class , an anti-subword class model is 
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trained using all speech segments corresponding to sounds that are not modeled 
by any of the subwords in that subword class . Based on the classes shown in 
Table 5, a total of 6 anti- subword class models are constructed. The 
anti-subword model, s.sub.j, is now considered to be the anti-subword class 
model corresponding to the class to which s.sub.j belongs. 
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ABSTRACT : 

Speaker- specif ied hints are used to establish conditions for a speech 
recognition system to select a recognition result for a previously provided 
utterance from among various possible homophones. The hints may characterize 
the utterance by a linguistic property, such as an orthographic, morphological, 
or semantic property. 

25 Claims, 6 Drawing figures 

Exemplary Claim Number: 1 

Number of Drawing Sheets : 6 
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Detailed Description Text - DETX (4) : 

The raw spectral information obtained from the front end circuitry 20 is 
further preprocessed in the computer 23 to replace each sample or input frame 
with an index which corresponds to or identifies one of a predetermined set of 
standard or prototype spectral distributions or frames. In the particular 
embodiment being described, 1024 such standard frames are utilized. In the 
art, this substitution is conventionally referred to as vector quantization and 
the indices are commonly referred to as VQ indices . The preprocessing of the 
input data by the computer 23 also includes an estimating of the beginning and 
end of a word or continuous phrase in an unknown speech input segment, e.g. 
based on the energy level values. For this purpose, the input circuitry may 
incorporate a software adjustable control parameter, designated the 
"sensitivity" value, which sets a threshold distinguishing user speech from 
background noise. 



Detailed Description Text - DETX (5) : 

Vocabulary models are represented by sequences of standard or prototype 
states, which are represented by indices . Rather than representing spectral 
distributions, the state indices identify or correspond to probability 
distribution functions. The state spectral index essentially serves as a 
pointer into a table which identifies, for each state index, the set of 
probabilities that each prototype frame or VQ index will be observed to 
correspond to that state index . The table is, in effect, a precalculated 
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mapping between all possible frame indices and all state indices . Thus, for 
comparing a single frame and single state, a distance measurement or a measure 
of match can be obtained by directly indexing into the tables using the 
respective indices and combining the values obtained with appropriate 
weighting. It is thus possible to build a table or array storing a distance 
metric representing the closeness of match of each standard or prototype input 
frame with each standard or prototype model state. The distance or likelihood 
values which fill the tables can be generated by statistical training methods. 
A preferred system for precalculating and storing a table of distance 
measurements is disclosed in U.S. Pat. No. 5,546,499. The disclosure of that 
application is incorporated herein by reference. 



Detailed Description Text - DETX (7) : 

In isolated word recognition, the sequence of frames which constitute the 
unknown speech segment taken together with a sequence of states representing a 
vocabulary model in effect define a matrix and the time warping process 
involves finding a path across the matrix which produces the best score, e.g., 
least distance or cost. The distance or cost is typically arrived at by 
accumulating the cost or distance values associated with each pairing of frame 
index with state index as described previously with respect to the VQ (vector 
quantization) process. An isolated word speech recognition system will 
typically identify the best scoring model and may also identify a ranked list 
of possible alternates. 



Detailed Description Text - DETX (33) : 

The alternative filtering may be achieved in various manners. In one 
approach, all alternatives which lack the characters described in X are 
removed. Some pre-processing may be done, for example, "double L" would be 
replaced by "LL", then all alternatives not containing "LL" are removed. 
Another filtering approach exploits the fact that many hints are related to 
verb endings ("sent with a t") . Accordingly, the system may check whether the 
last letter (s) of the verb correspond to X. In this manner, X can be restrained 
to commonly confusable verb endings (e.g., d, t for English; e, s, es, t, ent 
for French) . In another filtering approach, identifiers in a dictionary may be 
utilized to show to which letter a hint applies, if present (an index to a 
start position in the word string would suffice) . For example, to 
differentiate KAN J I characters, the hint may be stored in the dictionary entry 
for a word, such as in a field indicating the number of strokes in the 
character . 



Detailed Description Text - DETX (35) : 

A preferred embodiment also has language model and grammar implications. In 
speech recognition, a word or a command can only be recognized if it is part of 
a grammar of a language model. This also applies to the hints as used in a 
preferred embodiment. Different options are possible to add hints to a 
language model. For example, the hint phrase "spelled with" may be modeled in 
the same way as a "capitalize that" command. That is, the hint can occur at 
any point in the dictation, after any word. This can be modeled by giving the 
hint a unigram occurrence probability. The value of the probability should be 
in line with the probability assigned to other commands such as "capitalize 
that". Alternatively, "spelled with" may be constrained to occurring only 
after certain classes of confusable words; for example, only after verbs. 
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ABSTRACT : 



A continuous speech prefiltering system for use in continuous speech 
recognition computer systems. The speech to be recognized is converted from 
utterances to frame data sets, which frame data sets are smoothed to generate a 
smooth frame model over a predetermined number of frames. A resident 
vocabulary is stored within the computer as clusters of word models which are 
acoustically similar over a succession of frame periods. A cluster score is 
generated by the system, which score includes the likelihood of the smooth 
frames evaluated using a probability model for the cluster against which the 
smooth frame model is being compared. Cluster sets having cluster scores below 
a predetermined acoustic threshold are removed from further consideration. The 
remaining cluster sets are unpacked for determination of a word score for each 
unpacked word. These word scores are used to identify those words which are 
above a second predetermined threshold to define a word list which is sent to a 
recognizer for a more lengthy word match. A controller enables the system to 
initialize times corresponding to the frame start time for each frame data set, 
defining a sliding window. 



27 Claims, 3 Drawing figures 
Exemplary Claim Number: 22 
Number of Drawing Sheets: 2 
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Brief Summary Text - BSTX (11) : 

Continuous speech computational requirements are even greater. In 
continuous speech, the type of which humans normally speak, words are run 
together, without pauses or other simple cues to indicate where one word ends 
and the next begins. When a mechanical speech recognition system attempts to 
recognize continuous speech, it initially has no way of identifying those 
portions of speech which correspond to individual words. Speakers of English 
apply a host of duration and coarticulation rules when combining phonemes into 
words and sentences, employing the same rules in recognizing spoken language. 
A speaker of English, given a phonemic spelling of an unfamiliar word from a 
dictionary, can pronounce the word recognizably or recognize the word when it 
is spoken. On the other hand, it is impossible to put together an "alphabet" 
of recorded phonemes which, when concatenated,, will sound like natural English 
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words. It comes as a surprise to most speakers, for example, to discover that 
the vowels in "will" and "kick", which are identical according to dictionary- 
pronunciations, are as different in their spectral characteristics as the 
vowels in "not" and "nut", or that the vowel in "size" has more than twice the 
duration of the same vowel in "seismograph". 



Detailed Description Text - DETX (3) : 

FIG. 1 is a general flow diagram showing the flow of information or data of 
the present invention. As shown, Phase I involves the flow of data from the 
user, in the form of utterances UT, through a series of transformers into 
transform data TR. The transform data is concurrently sent to a recognizer R 
and a processing, or pre-filter system PF. While the recognizer R processes 
the transform data TR, it queries the pre-filter system PF for data. Phase II 
involves the flow of transform data TR to the pre-filter system PF. Phase III 
then involves data flow of pre-filter data to a recognizer R upon query by the 
recognizer R. User U receives recognizer data in the form of a monitor word 
display on a monitor M. Each phase will separately be discussed below. The 
system of the present invention is used during Phase II for converting 
transform data into pre-filter data which is then sent for more lengthy 
filtering at a recognizer (Phase III) . 
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A system is provided for allowing a user to add word models 
to a speech recognition system. In particular, the system 
allows a user to input a number of renditions of the new 
word and which generates from these a sequence of pho- 
nemes representative of the new word. This representative 
sequence of phonemes is stored in a word to phoneme 
dictionary together with the typed version of the word for 
subsequent use by the speech recognition system. 



PHONEME 
MODELS 



PRE- 
PROCESSOR 

27 

USER 
INTERFACE 



SPEECH 
RECOGNITION 
ENGINE 



WORD 
DECODER 



29 



CONTROL 
UNIT 



WORD TO 
PHONEME 
DICTIONARY 



WORD 
MODEL 
GENERATION 
UNIT 



07/17/2003, EAST version: 1.04.0000 



